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Module 1 
UCTION TO DATA AND DATA SCIENCE 


Data and the Data Ecosystem 











Data Science and its Applications 


PWNR 





Data Science Roles and Tasks 





The Data Scientist 





The Data Science Workflow 





Data Collection and Storage 





Preparation, Exploration and Visualization 








Experimentation and Prediction 








Learning Outcomes 


. Define data, its properties, importance and capabilities 

. Explain the drivers of data and the current data ecosystem 
. Define Data Science and differentiate its applications 

. Differentiate the Data Science roles and enumerate the 


tools needed by each role 


. Explain the skills and characeristics that a Data Scientist 


must possess 


. Explain the phases of the Analytics lifecycle and relate 


these to the Data Science workflow 


. Identify, enumerate and explain the elements of the Data 


Science workflow 
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Data and the Data Ecosystem 





+ information collected for use 
* information, especially facts or numbers, collected to be examined and considered and used to help 
decision-making, or information in an electronic form that can be stored and used by a computer 





* actual information (such as measurements or statistics) used as a basis for reasoning, discussion or 
calculation 

* information in digital form that can be transmitted or processed 

+ information output by a sensing device or organ that includes both useful and irrelevant information and 
must be processed to be meaningful 


+ information in raw or unorganized form (such as alphabets, numbers, or symbols) that refer to, or 
represent, conditions, ideas or objects 
* symbols or signals that are input, stored, and processed by a computer, for output as usable information 


+ a set of values of subjects with respect to qualitative or quantitative variables 


% 
oe 
Data becomes information when it is viewed in context or in post-analysis. 
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Data and the Data Ecosystem 








WHAT CAN DATA DO? 





describe the current state of an organization or process (i.e., energy consumption) 
detect anomalous events (i.e, fraudulent purchases) 
diagnose the causes of events and behaviors (i.e., Spotify or Netflix activity) 


predict future events (i.e, forecasting population size) 
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Data and the Data Ecosystem 








WHAT CAN DATA DO? 





WEBSITE 
DATA 


MOBILE © OFFLINE! 

DATA l CRM DATA 
3RD-PARTY i ta PURCHASE 

DATA ` DATA 





* What data should be collected? 
* What methods are there for reasoning from data? 
* How do we get answers from the data to answer our most pressing questions about our businesses, our lives and 
our world? 
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Data and the Data Ecosystem 
BIG DATA 
Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new 
B I G technical architectures and analytics to enable insights that unlock new sources of business 
DATA value. — McKinsey Global Report (2011) 


KMG 


kaaas HU 





Data at Rest Datain Many || Data in Doubt Data into 
KEES Forms Money 

pan pips Terabytes to Streaming data, Structured, Uncertaintydueto || Business models can 

Peas Preece Exabytes of existing | |requiring milliseconds || unstructured, text, || data inconsistency & || be associated to the 

Lelia 7 datato process || to seconds to respond multimedia,... incompleteness, data 

jasjes ambiguities, latency, 

ams F deception, model 
F approximations 
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Data and the Data Ecosystem 








SOURCES OF THE BIG DATA DELUGE 
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Data and the Data Ecosystem 













DATA STRUCTURES 





. Data growth is increasingly unstructured, with 80 — 90% of future data 
N growth coming from unstructured data types. 
/ Structured 








Semi-Structured Data 


SISE] Structured Data 


o 
| i 
E 


IB Unstructured Data 
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Data and the Data Ecosystem 








DATA STRUCTURES 





* Structured Data: contain a defined data type, format and structure 


SUMMER FOOD SERVICE PROGRAM 1] 




















Kees) or 
[| | et rng 
oe 6 lr 

[se rt 
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a rt 





Transaction data 

Online Analytical Processing (OLAP) cubes 

Traditional Relational Database Management Systems (RDBMS) 
Comma-Separated Value (CSV) files 

Simple spreadsheets 
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Data and the Data Ecosystem 








DATA STRUCTURES 





* Semi-Structured Data: textual data files with 
discernible pattern that enables parsing data files 
that are self-describing and defined by an p 
Extensible Markup Language (XML) schema aii) 
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Data and the Data Ecosystem 












DATA STRUCTURES 











* Quasi-Structured Data: textual data with 
erratic data formats that can be formatted 
with effort, tools and time 


httpsy//www.google.com/#q=EMC+data+science 


https: //www.google.com/#q=EMC+datatscience 


https: //education.emc.com/guest/campaign/data_science.aspx 


https: //education.emc.com/guest/certification/framework/st 
f£/data_science.aspx 


https://education.emo.com/guest/certification, framework/stt/data_sciencs.aspx 
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Data and the Data Ecosystem 





BUSINESS DRIVERS FOR ADVANCED ANALYTICS 





Business Driver 
Optimize business operations 
Identify business risk 
Predict new business opportunities 


Comply with laws or regulatory requirements 


Examples 
Sales, pricing, profitability, efficiency 
Customer churn, fraud, default 
Upsell, cross-sell, best new customer prospects 
Anti-money laundering, fair lending 
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Data and the Data Ecosystem 








EVOLUTION OF ANALYTICS 
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Data and the Data Ecosystem 














DRIVERS OF BIG DATA 









MEASURED IN 


TERABYTES 


178 = 4,00068 


(RDBMS & DATA 
WAREHOUSE) 






MEASURED IN 


PETABYTES 


1PB = 1,00078 


2000s 
(CONTENT & DIGITAL ASSET 
MANAGEMENT) 


WILL BE MEASURED IN 


EXABYTES 


1EB = 1.000P6 






2010s 
(NO-SQL & KEY VALUE) 








Medical information, such as genomic 
sequencing and diagnostic imaging 
Photos and video footage uploaded to the 
WW web 

Mobile devices — geospatial location data, 
metadata about text messages, phone calls, 
application usage on smart phones 

Smart devices 

Nontraditional IT devices — RFID readers, 
GPS navigation systems, seismic processing 
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Data and the Data Ecosystem 








THE BIG DATA ECOSYSTEM 


@ Pata DE? Rare casts 


Devices aut ua 28 2. AM TAD COUER MD VD, MOUDA 
— OURVENIANÇE (RONG 
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/ Data and the Data Ecosystem -\ 


KEY ROLES IN THE BIG DATA ECOSYSTEM 





Deep Analytical Talents Data Savvy Professionals Technology and Data Enablers 


possess the skills to handle raw, possess the basic knowledge of statistics provide technical expertise to support 

unstructured data and to apply complex or machine learning and can define key analytical projects, such as provisioning 

analytical techniques at massive scales questions that can be answered using and administrating analytical sandboxes, 
advanced analytics and managing large-scale data 


architectures that enable widespread 
analytics within companies and other 
organizations 


statisticians, economists, mathematicians financial analysts, life scientists, operation computer engineers, programmers, 
managers, business managers database administrators J 
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Data Science and its Applications 


© 9 


DATA —> KNOWLEDGE —> ACTION 













Data Science is set of methodologies for taking in thousands of forms of available data and using them to draw 
meaningful conclusions. 


It represents the optimization of processes and resources to produce data insights — data-informed conclusions or 
predictions that can be used to understand (and improve) health, businesses and investments, lifestyles and social 
lives. 
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Data Science and its Applications 








MACHINE LEARNING 





Case study: fraud detection 





Amount Date Type ... 


M 


achine Learning Requisites 





A well-defined question 

o What is the probability that this transaction is fraudulent? 
A set of example data 

o Old transaction labeled as “fraudulent” or “valid” 
Anew set of data to use the algorithm on 

o New transactions 
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Data Science and its Applications 








INTERNET OF THINGS (loT) 





Case study: smart watch 


* Gadgets that aren’t standard computers 
o Smart watches 
o Internet-connected home security systems 
o Electronic toll collection systems 
o Building energy management systems 
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Data Science and its Applications 








DEEP LEARNING 





Case study: image recognition 
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Many neurons working together 
Requires more training data 
Used in complex problems 

o Image classification 


o Language learning / understanding 
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Data Science Roles and Tools 








THE DATA SCIENCE WORKFLOW 





R 


Data Collection & 
Storage 


Surveys 
Web results 
Geo-tagged posts 
Financial transactions 


> 


Data Preparation 


Finding missing or 
duplicate values 
Converting data 


> 


Exploration & 
Visualization 


Building dashboards 
Comparing data 





Experimentation & 
Prediction 


Building sytems 
Validating sytems 
Performing tests 
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Data Science Roles and Tools 








DATA ENGINEER 





+ Information architects 


Data Collection & 





* Control the flow of data oy => Data Preparation 
+ Build data pipelines and storage solutions so that 
Sa ; + 
data is easily collected, obtained and processed 
Exploration & = Experimentation & 
Visualization Prediction 


* SQL: to store and organize data 

e Java, Scala or Python: programming languages to process data 
<> * Shell: command line to automate and run tasks 

+ AWS, Azure, Google Cloud Platform: cloud computing to ingest 


and store large amounts of data 
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Data Science Roles and Tools 








DATA ANALYST 













Perform simple analyses to describe data 

Data Collection & = er ti 
Explore and clean the data and create Storage ta alias 
visualizations and dashboards to summarize data 

Describe the present via data + 


Experimentation & 
ATE re o Prediction 





* SQL: to retrieve and aggregate data relevant to the analysis 


* Spreadsheets (Excel or Google Sheets): to perform simple analyses on small 


i quantites of data 
il © * BI Tools (Tableau, Power BI, Looker): to create dashboards and share analyses 


* Python, R: for cleaning and analyzing data 
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Data Science Roles and Tools 


DATA SCIENTIST 









* Find new insights from data from statistical data, rather 
Data Collection & 
than solely describing data Storage 


* Use traditional machine learning for prediction and 





forecasting 


Seer eae 
Waele) 





* SQL: to retrieve and aggregate data relevant to the analysis 


* Python and/or R with associated libraries (pandas / tidyverse) 
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Data Science Roles and Tools 








MACHINE LEARNING SCIENTIST 









Similar to data scientists, but with a machine learning 






specialization a Data Preparation 
Storage 

Go beyond machine learning with deep learning 

Strong focus on prediction £ 


a ee poe eeu eset 


SEDC Dee elias) 





+ R, Python: to create predictive models, with associated 
libraries, like TensorFlow to run powerful deep learning 


algorithms 
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Data Science Roles and Tools 












Data Engineer 
Store and maintain 
data 


SQL+ 
Java/Scala/Python 


a >o 


Data Analyst 


Visualize and describe data 


SQL + BI Tools + 
Spreadsheets 


Data Scientist 


Gain insights from 
data 


Python/R 





Machine Learning 
Scientist 


Predict with data 


Python/R 
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The Data Scientist 












Find new insights from data from statistical data, rather 


Data Collection & A 
than solely describing data Storage — 


Use traditional machine learning for prediction and 





forecasting 


Lot ed 
ceita ir 


* SQL: to retrieve and aggregate data relevant to the analysis 


* Python and/or R with associated libraries (pandas / tidyverse) 
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The Data Scientist 








THE DATA SCIENTIST VS THE DATA ENGINEER 








Dézyra) 
sass eaa 





DATA SCIENCE 


DATA ENGINEERING 





the computational science of extracting meaningful 
insights from raw data and then effectively 
communicating those insights to generate value 


the engineering domain that is dedicated to building 
and maintaining systems that overcome data 
processing bottlenecks and data handling problems 
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The Data Scientist 


DATA SCIENTISTS’ SKILL SET AND BEHAVIORAL CHARACTERISTICS 
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The Data Science Workflow 








THE DATA SCIENCE WORKFLOW 





R 


Data Collection & 
Storage 


Surveys 
Web results 
Geo-tagged posts 
Financial transactions 


> 


Data Preparation 


Finding missing or 
duplicate values 
Converting data 


> 


Exploration & 
Visualization 


Building dashboards 
Comparing data 





Experimentation & 
Prediction 


Building sytems 
Validating sytems 
Performing tests 
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The Data Science Workflow 








THE ANALYTICS LIFECYCLE 


o 


t 





Do I have enough information 
to draft an analytic plan and 
share for peer review? 














Data 
ae rie 


Do | have enough good quality 
data to start building the 
model? 





Do I have a good idea about 


Is the model robust enough? e eee the type of model to try? Can 





| 


| refine the analytic plan? 
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The Data Science Workflow 








THE ANALYTICS LIFECYCLE 





t 








The team learns the business domain, including relevant history such as whether the 
unit has attempted similar projects in the past from which they can learn. 








The team assesses the resources available to support the project in terms of people, 
technology, time and data. 








The team frames the business problem as an analytics challenge that can be 
addressed in subsequent phases and formulates initial hypotheses (IHs) to test and 
begin learning the data. 








The team executes extract, load and transform (ELT) or extract, transform and load 
(ETL) to get data into the sandbox so the team can work with it and analyze it. 














The team familiarizes itself with the data thoroughly and take steps to condition it. 
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The Data Science Workflow 





THE ANALYTICS LIFECYCLE 





The team determines the methods, techniques and workflow it intends to follow for 
the subsequent model building phase. 








The team explores the data to learn about the relationships between variables and 
subsequently selects key variables and the most suitable models. 








The team develops datasets for testing, training and production purposes. 








The team builds and executes models based on the work done in the model planning 
phase. 











The team considers whether its existing tools will suffice for running the models, or if 
it will need a more robust environment fro executing models and workflows 
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The Data Science Workflow 





a% 
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THE ANALYTICS LIFECYCLE 





The team determines if the results of the project are a success or a failure based on 
the criteria developed in Phase 1. 








The team identifies key findings, quantifies the business value, and develops a 
narrative to summarize and conveys findings to stakeholders. 








The team delivers final reports, briefings, code, and technical documents. 











The team may run a pilot project to implement the models in a production 
environment. 
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y f The Data Science Workflow 








THE ANALYTICS LIFECYCLE: DISCOVERY 





Start 
Learning the Renee domain 
Identifying resources 
Framing the Problem 


Identifying Key Stakeholders 


Interviewing the Analytics Sponsor 


A 


How much business domain or knowledge is needed to develop models? 


What technology, tools, systems, data, people are needed? 


What is the problem, the objectives and success and failure criteria? 


Who will benefit from the project or will be significantly impacted by the 
project? 


Who has final decision-making authority on the project? 
How will the focus and scope of the problem change if the following 
dimensions change: time, people, risk, resources, size and attributes of data? 
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The Data Science Workflow 


THE ANALYTICS LIFECYCLE: DISCOVERY 








A 


Developing the initial hypotheses What are the ideas that the team can test with data? 


Identifying potential data sources What data (including volume, type and time span) will the team need to solve 


the problem? 


Activities: 
* Identify data sources: inventory of available and needed datasets 
* Capture aggregate data sources 
* Review the raw data: quality and limitations of data 
e Evaluate the data structures and tools needed 
* Scope the sort of data infrastructure needed for this type of problem: disk 
storage and network capacity 


End 
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The Data Science Workflow 








THE ANALYTICS LIFECYCLE: DATA PREPARATION 





Start 
Preparing the analytic sandbox obtaining an analytic sandbox (or workspace) in which the team can explore the 
data without interfering with live production databases 
aR ETLT assessing data quality and structuring the datasets properly so they can be used 
for robust analysis 
extracting, transforming, loading or extracting, loading, transforming data 
(advocated by analytic sandboxes to preserve raw data ~ a good practice is to 
= make an inventory of the raw and current data available to datasets) 
A 
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T The Data Science Workflow 


THE ANALYTICS LIFECYCLE: DATA PREPARATION 





Learning About the Data 


understanding what constitutes a reasonable value and expected output versus 


what is a surprising finding 
cataloging the data sources that the team has access to and identifying additional 
sources that the team can leverage but does not have access to 





Dataset 


Data Available and 


Data Available but 


Data to Collect 


Data to Obtain from 











Accessible Not Accessible 3° Party Sources 
Product shipped v 
Product financials v 
Product call center data w 





Live product feedback 
surveys 








Product sentiment from 
social media 
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The Data Science Workflow 








THE ANALYTICS LIFECYCLE: DATA PREPARATION 





Data Conditioning 


Surveying and Visualization 


End 


determining the data sources and how clean the data is 

determining to what degree the data contains missing or inconsistent values and 
if the data contains values that deviate from normal 

assessing the consistency of the data types 

reviewing data content 

looking for evidence of systematic error 


leveraging data visualization tools to gain an overview of the data ~ examining 
data quality for unexpected values or skewness 


Guidelines: 
e Ensure that calculations are consistent and that data distribution is consistent 
* Assess the granularity and aggregation of the data and the range of values 
* Determine whether the data is standardized / normalized 
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The Data Science Workflow 








THE ANALYTICS LIFECYCLE: MODEL PLANNING 





Start | 


Data Exploration and Variable Selection 


Model Selection 


(End 


understanding the relationships among the variables to inform selection of the 
variables and methods and to understand the problem domain (a good way 
is to use tools for data visualization) 

testing a range of variable to include in the model and then focusing on the 
most important and influential variables 


choosing an analytical technique, or a short list of candidate techniques, based 
on the end goal of the project 
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The Data Science Workflow 





THE ANALYTICS LIFECYCLE: MODEL BUILDING 








Considerations 


* Does the model appear valid and accurate on the test data? 

* Does the model output / behavior make sense to the domain experts? 

* Do the parameter values of the fitted model make sense in the context of the domain? 

+ Is the model sufficiently accurate to meet the goal? 

* Does the model avoid intolerable mistakes? ~ false positives and false negatives 

* Are more data or more inputs needs? Do any of the inputs need to be transformed or elimitated? 
* Will the kind of model chosen support the runtime requirements? 

* Isa different form of the model required to address the business problem? 
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The Data Science Workflow 








THE ANALYTICS LIFECYCLE: RESULTS COMMUNICATION 





{Start 








+ 





Record all findings; select the three most significant to share 
with the stakeholders 





+ 





Reflect on the implications of the findings 





l 





Measure the business impact of the results and 
demonstrate the value of the findings 











Make recommendations for future work or improvements to 
existing processes 
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The Data Science Workflow 








THE ANALYTICS LIFECYCLE: OPERATIONALIZATION 





| Start ) 


= 





Deploy the new analytical methods or models in a 
production environment (pilot level) 





| 





Create a mechanism for perform ongoing monitoring of 
model accuracy 





+ 








Learn from the deployment and make the needed 
adjustments 








| 


| End ) 
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The Data Science Workflow 








Business User 


O 





Understands the domain area 
and usually benefits from the 
results 





Consults and advises the 
project team on the context of 
the project, the value of the 
results and how the outputs 
will be operationalized 











KEY ROLES FOR A SUCCESSFUL ANALYTICS PROJECT 


Project Sponsor 


y 


Provides the requirements for 
the project and defines the 
core business problem 


Provides the funding and 
gauges the degree of value 
from the final outputs of the 
working team, sets priorities 
for the project and clarifies the 
desired outputs 








Project Manager 


© 





Ensures that key milestones 
and objectives are met on time 
and at the expected quality 








Business Intelligence Analyst 


én 


1 






Provides business domain 


expertise based on an 
understanding of the data, KPIs, 
key metrics and business 
intelligence from a reporting 
perspective 








Create dashboards and reports 
and have knowledge of the data 
feeds and sources 
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The Data Science Workflow 








KEY ROLES FOR A SUCCESSFUL ANALYTICS PROJECT 








Database Administrator 


WaT 


Provisions and configures the 
database environment to 
support the analytics needs of 
the working team 








Provides access to key 
databases or tables and 
ensures the appropriate 
security levels are in place 
related to the data repositories 








Data Engineer 


Assists with turning SQL 
queries for data management 
and data extraction, and 
provides support for data 
ingestion into the analytic 
sandbox 


Executes the actual data 
extractions and performs 
substantial data manipulation 
to facilitate the analytics 





Data Scientist 





Provides subject matter expertise 


for analytical techniques, data 
modeling and applying valid 
analytical techniques to given 
business problems 





Ensures overall analytics objectives 
are met 








Designs and executes analytical 
methods and approaches with the 
data available to the project 
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The Data Science Workflow 





ANALYTICS DELIVERABLES 





Presentation for Project 
Sponsors 


O 
inl 


Presentation for 
Analysts 


pali 


Code 


-N 


Technical Specifications 


e ? 
e ®- 


è 








Contains high-level 
takeaways for executive 
level stakeholders, with a 
few key messages to aid 
their decision-making 
process; contains clean, easy 
visuals for the viewer to 
grasp 





Describes business process 
changes and reporting 
changes; contains details 
and technical graphs 





For technical people 








For implementing the code 























DS100: APPLIED DATA SCIENCE 


46 


The Data Science Workflow 








PROJECT STAKEHOLDERS’ OUTPUTS 





Business User 





O 
TT 


determines the benefits and 
implications of the findings to 
the business 





Project Sponsor 


| 





© 
faol 


Project Manager 


o0 


o> 


Business Intelligence Analyst 


pali 





asks questions related to the 
business impact of the project, 
the risks and return on 
investment, and the way the 
project can be evangelized 
within the organization 








determines if the project was 

completed on time and within 

budget and how well the goals 
were met 





determines if the reports and 
dashboards he manages will be 
impacted and need to change 




















J 
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The Data Science Workflow 








PROJECT STAKEHOLDERS’ OUTPUTS 





Database Administrator 


we 


needs to share his code from 
the analytics project and create 
a technical document on how 
to implement it 





Data Engineer 


ie 





oe 


= 


Data Scientist 


ae 
g- 
avh 


=) pali 








needs to share his code from 
the analytics project and create 
a technical document on how 
to implement it 








needs to share the code and 


explain the model to his peers, 


managers and other 
stakeholders 
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Data Collection and Storage 





w 


-o 


Data Collection & 
Storage 





DATA SOURCES 





Company Data 





Web events 

Survey data 
Customer data 
Logistics data 
Financial transactions 


Open Data 


Application Programming Inferfaces 
o Twitter 
o Wikipedia 
o Google Maps 
Public Records 
o International Organizations (World Bank) 
o National Statistical Offices (surveys) 
o Government Agencies (weather, population) 
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Data Collection and Storage 








DATA TYPES 





Quantitative Data 


can be counted, measured and 
expressed using numbers 


Qualitative Data 


©-© 
© 
o5? 


can be observed but not counted 
descriptive and conceptual 





60 inches tall 
has 2 apples 
costs $1,000 





red 


made in Italy 


smells like fish 
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Data Collection and Storage 












DATA TYPES 








Image Data Text Data 









+ made up of pixels that contain * social media posts 
information about color and intensity * reviews 
+ emails 
+ documents 





DS100: APPLIED DATA SCIENCE 





Data Collection and Storage 













DATA TYPES 
Geospatial Data Network Data 
Data source 
Street datai e 
I> e Hen 
+ Gaby 
Buildings data 
e 
o> ae a 
ner A 
aber 
e 
Claudia 





* shows relationships between nodes (people or things) 
e data with location information (i.e., roads, 
buildings, vegetation) 
* useful for navigation apps 
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Data Collection and Storage 








DATA STORAGE AND RETRIEVAL 





Location: Parallel Storage Location: Cloud Storage 








* cluster or servers for storage and easy * Microsoft Azure 
access to data + Amazon Web Services (AWS) 
* Google Cloud 
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Data Collection and Storage 
















DATA STORAGE AND RETRIEVAL 
Type: Document Database Retrieval: Data Query 
* stores unstructured data (i.e, email, text, 
video and audio files, social media 
messages) wo © DataType Query Language 





Document Database NoSQL 


Type: Relational Database 7 
Relational Database SQL 


* stores structured data 


Customer Name | Customer Address | B 


Jane Doe | 123Maplest. |» 
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Preparation, Exploration and Visualization 





a 


Data Preparation 





DATA PREPARATION 





Why? 
* Real-life data is messy 
e To prevent: 
© errors 
© incorrect results 
o biasing algorithms 





























Sara Lis Hadrien Lis 
Age ug” “30” 430” 
Size 1.77 5.88 1.80 5.58 
Country | “Belgium” “USA” “ER” “USA” 
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Preparation, Exploration and Visualization 
Tidy Data Remove Duplicates 
Sara Lis Hadrien Lis Name Age Size Country 
Age "y7 *30” “30” Sara "27" 1.77 “Belgium” 
Size 1.77 5.88 5.58 Lis “30” 5.88 “USA” 
Country | “Belgium” | “USA” “USA” Hadrien 1.80 “FR” 
* The observations (people) are in columns and their Lis £30 2:88 FUSAN 
features are on rows. * Lis appears twice. 
O The observations should be in rows and the features in Q Clean data should not have duplicates. 
columns. 
Name Age Size Country Name Age Size Country 
Sara 99” 1.77 “Belgium” Sara way 1.77 “Belgium” 
Lis “30” 5.88 “USA” Lis Mrgn 5.88 “USA” 
Hadrien 1.80 PERE: Hadrien 1.80 “FR” / 
| Lis “og 5.88 “USA” 
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Preparation, Exploration and Visualization 


OON 





& 

















Use Unique ID Ensure Consistenc' 
Name Age Size Country ID Name Age Size Country 
Sara oy 177 “Belgium” o Sara 7" 1.77 “Belgium” 
Lis “30” 5.88 “USA” 1 Lis “30” 5.88 “USA” 
Hadrien 1.80 “ER? 2 Hadrien 1.80 “FR” 






































* The second Lis entry was removed, but it could be a valid 
entry. 


Q Each observation must have a unique ID. 


* Size seems to be in different measurement units. 


QO Measurements should use consistent units. 





















































ID Name Age Size Country ID Name Age Size Country 

o Sara og 1.77 “Belgium” o Sara ‘a7! 1.77 “Belgium” 

a Lis nag 5.88 “USA” 1 Lis ASOT 1.70 “USA” 

2 Hadrien 1.80 “ER? 2 Hadrien 1.80 “FR” j 
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> š i è à i OON 
Preparation, Exploration and Visualization 
Ensure Homogeneity Correct Data Types 
ID Name Age Size Country ID Name Age Size Country 
Sara 7" 1.77 “Belgium” Sara ot 1.77 “BE” 
i Lis *30” 1.70 “USA” 1 Lis sson 1.70 “USA” 
2 Hadrien 1.80 HPR” 2 Hadrien 1.80 “FR” 
* Two of the countries are abbreviated, one is not. + “Age” is encoded as text. 
O Entries must be homogenous. Q Correct data types must be used. 
ID Name Age Size Country ID Name Age Size Country 
Sara 97? 1.77 “BE” Sara 27 1.77 “BE” 
1 Lis Mag 1.70 “USA” 1 Lis 30 1.70 “USA” 
2 Hadrien 1.80 “FR” 2 Hadrien 1.80 “FR” 
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& Preparation, Exploration and Visualization 


Correct Missing Values 






















































































ip N K Si ate Reasons for Missing Values 
ame ge ize ountry « data entry 
o Sara 27 1.77 “BE” e error 
1 Lis 30 1.70 “USA” * valid missing value 
2 Hadrien 1.80 “FR” 
Solutions 
* Hadrien’s Age is missing. * impute 
* drop 
ke 
ID Name Age Size Country FER | 
“pe” 
9 Sara 27 1:77 BE Sara Lis Hadrien Lis | 
1 Lis 30 1:70 “USA” ree ay” azg" azo 
i upp” 
2 Hadrien 28 1.80 FR Size 177 5.88 1.80 5.58 
Country | “Belgium” “USA” “FR” “USA” 4 
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Preparation, Exploration and Visualization 











l | Exploration& || =: 








Doria Preparation 








EXPLORATORY DATA ANALYSIS 








(aaa » a 
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Preparation, Exploration and Visualization 


Anscombe’s Quartet 





Dataset 1 Dataset 2 Dataset 3 MICE a 

| $ 

a 

CECE) |10.@ |7.46 | e j6. 

ee 18.0 16.77 | CEN 

l 113.8 18. 113.0 112.74] Gh ie 

N: A:N 19.0 17.11 | CES 

I 111.0 19. 111.0 17.81 | ace 

Lace ee raira 

i i e AEA CEN 

I 14.0 13. CA EEN MEELA] 
|12.@ |10.84] ANN 112.0 18.15 | 18.0 15.56 | 
PE E |7. |7.0 [6.42 | [8.8 {7.91 | 
eC eC S eC Ee ee CE AA 
N=11 mean (x) =9 mean (y) = 7.5 
r = 0.82 sd (x) = 3.32 sd (y) = 2.03 
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Preparation, Exploration and Visualization 


Illustrative Example. SpaceX Launches Dataset 


1. Data Preview 


art amc 


2018-86-84 
2010-12-08 
PIVE ey ad 
2012-10-08 
2013-03-01 


Payload Mass (kg) 
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Time (UTC) 


18:45:00 
15:43:00 
7:44:08 
8:35:08 
15:10:00 


LEO 
LEO (ISS) 
LEO (ISS) 
LEO (ISS) 
LEO (ISS) 





Ltrs dig 


Customer 


Rei 
NASA (COTS) NRO 
NASA (COTS) 
VEAC] 
NASA (CRS) 


ESC) 


Launch Site 


fuerte) 
ey 
CCAFS 
CCAFS 
A 


Mission 


E 
E 
EE 
E 
Rett) 


beth ete) 9 


EACT 

Dragon Spacecraft Qualification Uni 
E Ea ET AEE 
Dragon demo flight C2+ 

E 

SpaceX CRS-2 


Landing Outcome 


Failure (parachute) 
Failure (parachute) 
No attempt 
EE 





ERGI 


Flight Number (number) 
Date (datetime) 

Time (datetime) 
Booster Version (text) 
Launch Site (text) 
Payload (text) 

Orbit (text) 

Customer (text) 

Mission Outcome (text) 


Landing Outcome (text) 





Preparation, Exploration and Visualization 
Illustrative Example. SpaceX Launches Dataset 


2. Descriptive Statistics 





eb amc ECCO ETSE Lat) * 55 launches 


count 55 cy 55 it * 2 missing values in the Payload Mass column 
C i rs 
top 6 2018-03-30 4:45:00 CCAFS LC-40 
freg 1 4 5 ga 


Payload Mass (kg) Cres Pres Me rs Mm CR cc 


* 1 failed mission 


Pre EEJ 55 

unique i k; a] 4 

top G NASA (CRS) Deog SaL 
aE 2 i 54 
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Preparation, Exploration and Visualization 


Illustrative Example. SpaceX Launches Dataset 





3. Visualization 


Launches by year and site 





o Launch Site 


e No launch in 2011 Prior to 2017, launches originate from Cape 
` : a Carnaveral Air Force Station (CCAFS) 
* Gradual increase in launches before doubling 
$ * In 2017, most rockets launched from Kennedy 
in 2017 
Space Center (KSC) 
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Preparation, Exploration and Visualization 








Illustrative Example. SpaceX Launches Dataset 


3. Visualization 
Payload mass distribution 


Launches by year and mission outcome 





Mission Outcome 
mm Faire (inflight) 
oe 
Success (payioad status unclear) 
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Preparation, Exploration and Visualization 











Notes on Plot Preparation 
* Use color purposefully 
* Use color palettes that are 
distinguishable, even by color-blinded 
people 
* Use readable fonts (sans serif) 
Use titles, axes labels and legends 





Sl mero. ae 
[2012 PRESIDENTIALRUN | 


MOW CONCERNED ARE YOU ABOUT THE ZIKA VIRUS? 


l et 
vA 


Lh 
M é 
= 
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Preparation, Exploration and Visualization 











DASHBOARDS 
* group all the relevant information in one place to make it easier to gather insights and act on them 
Sales Summary 
2 sas7a oss (53590 we a Bo TEs” 





ri EEE TET EET ESE 
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Experimentation and Prediction 





Experimental Procedure 


(A) 


- 
a 
S 


` | Experimentation & 
Prediction 








to 
. Bek 
Dots Crsparction ] > | p 
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Experimentation and Prediction 









Case Study: Which is the better blog title? A B 





Become an You won't 
expert Data believe these 
i i i icks? Scientist with tips for 
lee Does blog title A or blog title B result in more clicks? thisone weird | BEES 


trick! Data Scientist! 


ee | Blog title A and title B result in the same amount of clicks. W = E 
| maaa | 50% of users will see blog title A, 50% title B. Track click-through rate until sample size is ane 
| Techs Is the difference in titles’ clilck-through rates significant? 

( merresresuts | Choose title. Or design another experiment. 
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Experimentation and Prediction 








A/B TESTING 





* also called Champion / Challenger Testing 
* used to make a choice between two options 


Calculate 





[ emere Click-through rate 
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Experimentation and Prediction 














A/B TESTING 
psec detect Hebsesity etc 
‘itz derteces inal aereas 
Calculate sample size Larger sample sizes allow us to detect smaller changes. | 
ees) i 
AB A B 


The experiment is run until the sample size is reached. 





B 
Become an expert Data You won't believe these tips for 
EAIN : a n 
— 
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Experimentation and Prediction 








A/B TESTING 





If the difference is significant, we can be reasonably 
sure that the difference is not due to random chance, 
but to an actual difference in preference. 


% who click 
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Experimentation and Prediction 








PREDICTIVE MODELING 





* by modeling a process, we can enter new inputs and see what outcome it outputs. 


Predictive Model 
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Experimentation and Prediction 








PREDICTIVE MODELING 











Case Study: Forecasting Time Series 


Pea Prices in Rwanda 

* Historical data for price of peas in Rwandan Francs from 2011 to 2016 

* Seasonality: prices are lowest around December and January and peak around 
August; some years show a second peak around April 

* General increase in pea prices annually 
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Experimentation and Prediction 


PREDICTIVE MODELING 





Case Study: Forecasting Time Series 


Pea price forecast 





* Confidence Interval: Model is X% sure that the true value will fall 
in this area 


2000- 

* The blue line depicts the forecast. 

* The seasonality remains and it anticipates a continued increase in 
seal pea prices, ssen by the higher peaks and lows. 


E 


a5 


Price (RWF) 
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Experimentation and Prediction 


SUPERVISED MACHINE LEARNING 





* use existing structured data to make prediction/s 
* used for recommendation systems, diagnosing biomedical images, recognizing hand-written digits, predicting customer churn. 


Case Study: Predicting Customer Churn 


Likely to cancel 
Customer subscription 


1. Model Preparation 
| 





Training Labels 
Data: Customer: 
Customers, outcomes J 
Likelyto stay ž churn | 2 aS SUE | = 
subscribed subscribe | subseribe 
FZR | eS aT a 
i subscribe p4 getao sdbseribe | 
churn Aer ga? churn 
| | 
zg subscribe |) p subscribe 
churn | chorn 
76 
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Experimentation and Prediction 





2. Model Use 3. Model Evaluation 


























Split the data | 


Trained Tera ety Test Data 
Model 


Train the model Evaluate the model 





eee ae True Labels Model Prediction | Model Accuracy 
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Experimentation and Prediction 


CLUSTERING 


oe 


‘ase Study: Discovering New Specie: 


* Flower colors 

* Petal length and width 
* Sepal length and width 
* Number of petals 


used to divide data into clusters 





Ci S 
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ye e 


(EEE E) 
(0000) 


Egg g 


Flower Observations 





Number of petals 


78 





Experimentation and Prediction 


CLUSTERING 


Case Study: Discovering New Species 
The user eventually decides the final number of clusters; domain knowledge is important 


ees in deciding this 
Four clusters 


Three clusters 
Flower Observations 


Sies ) 





Two clusters 


Flower Observations 


Flower Observations 





a 


Number of petals 


Number of petals 
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Case Study: Global Innovation Network and Analysis (GINA) 











BACKGROUND 





GINA: group of senior technologists located in centers of excellence (COEs) around the world 
Charter: to engage employees across global COEs to drive innovation, research and university partnerships 


New Director’s Objectives: 
* to improve these activities and provide a mechanism to track and analyze the related information 
+ to create more robust mechanisms for capturing the results of its informal conversations with other thought 
leaders within EMC, in academia, or in other organizations 


Plan: 
* provide a means to share ideas globally and increase knowledge sharing among GINA members who may be 
separated geographically 
* create a data repository containing both structured and unstructured data to accomplish the goals 


Goals 
* Store formal and informal data 
* Track research from global technologists 
* Mine the data for patterns and insights to improve the team’s operations and strategy 
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iw -aX 
/ Case Study: Global Innovation Network and Analysis (GIN. = 


DISCOVERY 





Business User Business Intelligence Data Engineer 
Project Sponsor Analyst Database Administrator 
Project Manager 


ems & ft 





Data Scientist 


ae 
g- 
avh 





Vice President from the Office of Representatives 
Representatives from IT 


Distinguished 
Engineer 





the CTO from IT 
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Case Study: Global Innovation Network and Analysis (GINA) 








DISCOVERY 














Data Categories 










































































Innovation Roadmap Minutes 
7 
5 years of idea | notes representing 
submissions from _ | innovation and 
internal innovation research activities 
contests from around the world 
SD: idea counts, SD: dates, names 
submission and geographic 
dates, inventor locations 
names 
UD: rich data 
UD: textual about knowledge 
description of growth and 
the ideas transfer 
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Case Study: Global Innovation Network and Analysis (GINA) 








DISCOVERY 


INITIAL HYPOTHESES 


A Descriptive analytics of what is currently happening can spark further creativity, collaboration and asset generation. 


w 


Predictive analytics can advise executive management of where it should be investing in the future. 


Innovation activity in different geographic regions can be mapped to corporate strategic directions. 

The length of time it takes to deliver ideas decreases when global knowledge transfer occurs as part of the idea delivery process. 
Innovators who participate in global knowledge transfer deliver ideas more quickly than those who do not. 

An idea submission can be analyzed and evaluated for the likelihood of receiving funding. 

Knowledge discovery and growth for a particular topic can be measured and compared across geographic regions. 

Knowledge transfer activity can identify research-specific boundary spanners in disparate regions. 

Strategic corporate themes can be mapped to geographic regions. 


Frequent knowledge expansion and transfer events reduce the time it takes to generate a corporate asset from an idea. 


o fæl ~n ia onle uf e 


Lineage maps can reveal when knowledge expansion and transfer did not (or has not) resulted in a corporate asset. 


10 Emerging research topics can be classified and mapped to specific ideators, innovators, boundary spanners and assets. 
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E Study: Global Innovation Network and Analysis (GINA) 





DATA PREPARATION 





Many of the names of the researchers and people interacting with the universities were misspelled or had leading and 
trailing spaces in the datastore. 





MODEL PLANNING 





* Social network analysis techniques 
* Initiate a longitudinal study — to begin tracking data points over time regarding people developing new intellectual 
property 


Considerations for the Parameters of the Longitudinal Study 

Identify the right milestones to achieve this goal. 

Trace how people move ideas from each milestone toward the goal. 

Trace ideas that die, and trace others that reach the goal. Compare the journeys of ideas that make it and those 
that do not. 

Compare the times and the outcomes using a few different methods (depending on how the data is collected and 
assembled). 
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Case Study: Global Innovation Network and Analysis (GINA) 








MODEL BUILDING 





* Natural Language Processing (NLP) for the textual descriptions in the Innovation Roadmap ideas 
* Social network analysis using R and RStudio 








Social graph showing submitters and finalists 
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Case Study: Global Innovation Network and Analysis (GINA) 











MODEL BUILDING 





@ 9 o 
Each color represents an innovator from a different country. The large dots 6 
with red circles around them represent hubs (person with high connectivity 
and a high “betweenness” score. 
Q 
One person has an unusually high score. A query on him yielded information o 99 
on his attendance to conferences (at different locations) and interactions Q Do g o z 
with other scientists. © OO. 
The finding suggests that at least part of the initial hypothesis is correct: the ie re 
data can identify innovators who span different geographies and business setwennnersRanks 
units. 1, 578 
2. 511 a 
3. 341 & XY} o 
Software / Database Used: 4 171 KS Oey 
age , 5. 138 OET, 
* Tableau — for data visualization and exploration 
* Pivotal Greenplum Database — main data repository ogre —o 
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Case Study: Global Innovation Network and Analysis (GINA) 





RESULTS 


The project was considered successful in identifying boundary spanners and hidden innovators. As a result, the CTO office 
launched longitudinal studies to begin data collection efforts and track innovation results over longer periods of time. 


Applying social network analysis enabled the team to find a pocket of people within EMC who were making 
disproportionately strong contributions. These findings were shared internally through presentations and conferences and 
promoted through social media and blogs. 





OPERATIONALIZATION 





/ 


Key Findings: 

* The CTO office and GINA need more data in the future, including a marketing initiative to convince people to inform 
the global community on their innovation / research activities. 
Some of the data is sensitive, and the team needs to consider security and privacy related to the data, such as who can 
run the models and see the results. 
In addition to running models, a parallel initiative needs to be created to improve basic Business Intelligence activities, 
such as dashboards, reporting, and queries on research activities worldwide. 
* Amechanism is needed to continually reevaluate the model after deployment. 
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Case Study: Global Innovation Network and Analysis (GINA) 








SUMMARY 





Component 


Discovery Business Problem Framed 


Initial Hypothesis 
Data 


Model Planning Analytic Technique 
Result and Key Findings 


Result/s 


Tracking global knowledge growth, ensuring effective knowledge transfer, and 
quickly converting it into corporate assets. Executing on these three elements 
should accelerate innovation. 


An increase in geographic knowledge transfer improves the speed of idea delivery. 
Five years of innovation idea submissions and history; six months of textual notes 
from global innovation and research activities 

Social network analysis, social graphs, clustering, and regression analysis 


1. Identified hidden, high-value innovators and found ways to share their 
knowledge 

2. Informed investment decisions in university research projects 

3. Created tools to help submitters improve ideas with idea recommender 
systems 
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Outline 





Module 1 
UCTION TO DATA AND DATA SCIENCE 


Data and the Data Ecosystem 











Data Science and its Applications 





Data Science Roles and Tasks 





The Data Scientist 





The Data Science Workflow 





Data Collection and Storage 





Preparation, Exploration and Visualization 





Experimentation and Prediction 








Case Study: Global Innovation Network and Analysis 








PWNR 


Learning Outcomes 


. Define data, its properties, importance and capabilities 

. Explain the drivers of data and the current data ecosystem 
. Define Data Science and differentiate its applications 

. Differentiate the Data Science roles and enumerate the 


tools needed by each role 


. Explain the skills and characeristics that a Data Scientist 


must possess 


. Explain the phases of the Analytics lifecycle and relate 


these to the Data Science workflow 


. Identify, enumerate and explain the elements of the Data 


Science workflow 
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